Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New NBFT initramfs module #2620

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

tbzatek
Copy link
Contributor

@tbzatek tbzatek commented Dec 13, 2024

Cc: @mwilck
-- Please check the included README.md first

This started as an experiment from an idea coming from our inhouse dracut and systemd developers. I kinda like the result even though there are many loose ends - read on. There's an unpublished internal plan to slim dracut down in the mid-term and while I can't speak of the details one major argument was something like "dracut was born before systemd" and "systemd nowadays can take over many tasks dracut is doing". Add embedded, automotive and otherwise resource-restricted environments and the push for minimization grows further.

This experiment was about creating specially crafted systemd units injected at the right place with as little dracut involvement as possible. I admit all this work may speak itself a little of a NIH syndrome, still I hope it becomes useful.

There are some loose ends here:

  • This is all based around NetworkManager with the hope to spark an interest of other distributors adding support for more network management frameworks.
  • Need to find a way how to play nice with the existing 95nvmf dracut module. Haven't approached dracut upstream yet.
  • Interface renaming is the thing. Originally planned to be part of the NetworkManager NBFT patch but NM folks wanted to keep their initramfs clean and minimal (for a good reason). So this landed as an extra nbft plugin command, at least for now. We can move this somewhere else as the only entry point is the systemd service here. Also the commandline arguments and help strings may need some tweaks. It was strongly suggested to use systemd link files instead of udev rules for interface renaming. However during my experiments with VLANs this appears to have some drawbacks, so this is still undecided and may opt for udev rules instead. TBD.
  • I have proposed slightly different nbft interface naming, geared towards predictable naming scheme.
  • Nothing is set in stone yet - the design, file and function naming can change.
  • As NetworkManager is a new consumer of the libnvme nbft API, we need to be more careful with future API additions and changes.

Needs #2614 (already merged).
This work also gives answer for #2179 and includes the files we've kept downstream so far.

And some more wild ideas:

  • There is a growing interest to split the libnvme NBFT parser in a separate, independent library (just like the -mi one) that would be small and suitable for initramfs, or statically linked. One suggestion was to have native udev support for nbft interface naming within builtin-net_id.c however the parser dependencies were a problem. The ACPI NBFT table is not a fixed-size struct and is not so trivial to parse if somebody wanted to reimplement the parser just to extract the HFI descriptors.

Intended to create udev network link files (see systemd.link(5))
from ACPI NBFT tables present in the system, typically called
during early boot process.

A semi-predictable interface naming is proposed to accurately
identify HFI that the interface is set up for. The syntax is as
follows: nbftXhY where X is the NBFT table index (filename, defaults
to 0) and Y is the HFI record index.

Signed-off-by: Tomas Bzatek <[email protected]>
This is a minimal dracut module providing Boot from NVMe/TCP
functionality. Consisting of two simple systemd units to set up
networking and perform actual connections afterwards.

Requires the NetworkManager nm-initrd-generator NBFT support.

Signed-off-by: Tomas Bzatek <[email protected]>
In case the `nvme connect-all --nbft` call fails, respawn the service
after 10 seconds and try again. Depends on nvme-cli to return non-zero
status in case of any of the SSNS records that are not marked
as 'unavailable' fails to connect.

The respawn cycle is broken by stopping the unit once rootfs
is mounted and system switches the root.

Signed-off-by: Tomas Bzatek <[email protected]>
@tbzatek tbzatek force-pushed the nbft-boot-refactor-1 branch from 5d23b62 to 3fdb4b9 Compare December 13, 2024 16:34
@mwilck
Copy link
Contributor

mwilck commented Dec 13, 2024

Very interesting. The timing is kind of unfortunate for me. I'll not be able to do an in-depth review until the beginning of next year.

Some quick questions / remarks. I've only glanced about this quickly, so I hope I'm not getting it all wrong.

  • Why do you submit this here rather than for dracut itself?
  • Am I assuming correctly that this module can't coexist with the existing dracut 95nvmf module?
  • Given that this is nvme-cli upstream, I find it questionable to include code that's solely focused on NetworkManager. The current dracut module is agnostic of the network management tool. I would prefer an implementation-agnostic approach here in nvme-cli, too. Simply putting the responsibility for that on "other distributions" isn't enough. Perhaps you could think of some sort of API that can be served by the nm nbft plugin, but alternatively by other plugins, too. Note the "wicked" already has an nbft extension, which is a simple shell script that prints interface configuration in an XML format that wicked understands.
  • I'm not too happy about the new interface naming scheme, given that we already introduced nbft$N. I'm not sure how important it is to have predictability. As you said yourself in the README, NBFT variables can change between reboots, by design. What matters is that the NVMe targets are found, and that it's obvious that the related interfaces have NBFT (aka firmware) origin, not necessarily which HFI they correspond to. There should at least be an option to fall back to the previous convention.
  • About "interface renaming is the thing" – was anything wrong with the way this was done in the existing dracut module?

@mwilck
Copy link
Contributor

mwilck commented Dec 13, 2024

In my mind, the current "API" of nvme-cli wrt NBFT is the JSON format. I am aware that that's subobtimal if you want to build a minimal initramfs, as it requires either a JSON library or something like jq, which is rather big. Still, discuss carefully if we want replace this by something else.

@igaw
Copy link
Collaborator

igaw commented Dec 16, 2024

One thing I'd like to bring up, is that I think we should reconsider the command-all command. The implementation is very complex, hard to maintain and easy to break. And I have the impression the use case which it suppose to cover are contradicting themself.

Almost all other commands by nvme-cli are just simple commands which do one thing. I think it should be possible build any complex logic ontop of it, e.g. using scripts.

My goal would be to move the missing parts of discover and the connect logic into the library, and
nvme-cli does just the command line parsing etc.

This is obviously all v3 material.

@tbzatek
Copy link
Contributor Author

tbzatek commented Jan 7, 2025

Thanks for your first comments, Martin. There's a little more background for this work. I'll try to explain but take it with a grain
of salt as nothing has been decided officially yet - it's just a plan that's being discussed widely inhouse.

There's a desire (Fedora, RHEL) to gradually slim dracut down mid-term or even phase it out in the long-term. There's an ambition to use projects like mkosi to minimize differences between initramfs and OS/rootfs environment.

This pull request is a first step towards this goal - minimizing dracut involvement in the boot process. Dracut would stay just as a tool to build the initramfs image, copying required files in place and not much more. Then it would be mostly a matter of changing systemd units and targets dependencies when porting to a different initramfs framework. This should answer some of your questions about "why not dracut".

Related to that, there's a tendency to move the particular initramfs functionality towards their native packages and transfer maintenance responsibility along with it. Updating a specific dracut module in a distro and the associated QA cost is often much higher than keeping the module at the edge, i.e. together with the required binaries and libraries.

Perhaps this work should be called an initramfs module instead.

@tbzatek tbzatek changed the title New dracut 95nbft module New initramfs module Jan 7, 2025
@tbzatek tbzatek changed the title New initramfs module New NBFT initramfs module Jan 7, 2025
@mwilck
Copy link
Contributor

mwilck commented Jan 16, 2025

There's a desire (Fedora, RHEL) to gradually slim dracut down mid-term

Understood. Similar here.

Even now, dracut has no monopoly on building initramfs images. mkosi is relatively new on the stage, but Debian's update-initramfs has existed for years. We've actually discussed about it in the Timberland group, but there didn't seem to be anyone from the Debian/Ubuntu realm interested in NVMeoF-boot. That's why only dracut is currently supported.

Dracut would stay just as a tool to build the initramfs image

Hm, that's exactly what it is today, no?

there's a tendency to move the particular initramfs functionality towards their native packages

That makes some sense.

The nvme-cli maintainers will need to decide whether they want to take the responsibility of maintaining the dracut module. It's tricky because the module interfaces intimately with both nvme-cli and dracut.

But I should look at the code first...

# /usr/lib/NetworkManager/dispatcher.d/99-nvme-nbft-connect.sh hook
# will respawn nvmf-connect-nbft.service on such occasion.

[device-nbft-no-ignore-carrier]
Copy link
Contributor

@mwilck mwilck Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this config section. Will you implement support for this section, and the directives it contains, in NetworkManager? If not, how is this supposed to work?

Copy link
Contributor

@mwilck mwilck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code quality is eccelent, AFAICS. I also like the design. For me, the open question is mostly which part of this code actually belongs into nvme-cli.

I'd prefer if we could separate out the NetworkManager-specific part more. So that we'd have obvious ways to extend this to other network management tools like wicked.

I believe that libnvme/nvme-cli shouldn't come with a preference for one specific network management tool, even if NM is the "de facto standard" these days.

Likewise, it would be nice not to depend on dracut directly and to just include the dracut code as a PoC, showing how the systemd services need to be complemented in the initramfs. It would be even nicer if we had a mkosi PoC as well.

Wrt dracut specifically, it's up to the nmve-cli / libnvme maintainers to tell whether they are willing to maintain a dracut module.

Two major packages are responsible for this: the new nvme-cli dracut module and
the added NBFT support in NetworkManager.

## The new dracut 95nbft module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: nothing ages so quickly as the word "new" ...

dracut to activate networking
* dracut runs `nm-initrd-generator` and starts the NetworkManager daemon
* `systemd-udev-trigger.service` renames the network interfaces
* `nm-wait-online-initrd.service` finishes, indicating networking is up and ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my remark below about nbft-boot-connect.service. I am somewhat confused about when exactly network-online.target is considered to be reached.


For simplicity and for the time being this systemd unit replaces the traditional
dracut cmdline hook and adds the `rd.neednet=1` `cmdline.d` argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly do you mean with "traditional dracut cmdline hook" here? How can you replace it?

## nm-initrd-generator NBFT support

https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/2077

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the most important difference wrt 95nvmf. You create a tool that directly transforms libnvme NBFT data structures into the preferrred format of your Network configuration tool (NM), whereas 95nvmf converts NBFT to JSON first, then JSON to dracut-style command line arguments, which would then further processed by the conventional nm-initrd-generator.

Your approach is of course much more efficient, but at the cost of being compatible only with (a future version of) NM.

A similar approach could be taken by wicked or other network management tools.

But I wonder if there might be some middle ground, perhaps we can provide the HFI data in some format that any network management tool can easily convert?

The "dracut command line" format is obviously very clumsy and simplistic.
I'd love to see dracut develop a more capable generic format for describing networking parameters. But I doubt that's going to happen. After all, dracut's format was designed with the kernel command line in mind.

So we could keep 95nvmf for those network management tools that have no native NBFT support, and take your approach for others. I'd need to discuss with the wicked people if it makes sense for us to write a similar plugin.

attempts are not driven by network link up events but have fixed respawn
interval. This may potentially help the cases where the NIC is slow to
initialize, reports link up yet it takes another 5+ seconds before it's fully
able to send/receive packets. We've seen this issue with some 25Gb NICs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Ideally we wouldn't react on "link up" events but on events that indicate an L3 connection. But I'm not sure if such events exist... (see below)

Note that 95nvmf also has a timeout action. But repeating this in regular intervals is actually a nice idea.

@@ -571,3 +571,108 @@ int show_nbft(int argc, char **argv, struct command *cmd, struct plugin *plugin)
}
return ret;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here you add support for one specific network configuration file format, udev's .link files. It would be possible to derive the .link files from the JSON output, but this is obviously much faster and more light-weight, especially compared to using jq for the conversion in the current code.

Yet it's not immediately obvious why we'd implement this in nvme-cli, while the conversion of HFI data to .nmconnection files is done in the Networkmanager plugin. Instead of adding this functionality to libnvme, you might as well have written a separate tool that links to libnvme and creates the .link files. Why do this here and that there?

The boot process looks roughly as follows:
* `nbft-boot-pre.service` is run, creates udev network link files and tells
dracut to activate networking
* dracut runs `nm-initrd-generator` and starts the NetworkManager daemon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dracut runs nm-initrd-generator

I fail to see where this happens. Is it an ExecStartPre= in NetworkManager.service?

match-device=interface-name:nbft*

# react on link up/down events
ignore-carrier=no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more intuitive to add globbing support to NetworkManager's "Device List" format, and then simply set something like

[main]
ignore-carrier=except:interface-name:nbft*

Anyway, this is Fedora/RHEL specific. Not sure if other distibutions ship tje Networkmanager-config-server package. On the libvme side, I'd like to keep the amount of NetworkManager-specific code as minimal as possible, thus I have my doubts whether this belongs here.

I'd prefer if you find some other way to set this up on Fedora. Perhaps simply by documenting that nbft devices should be excempted from the ignore_carrier= list in NetworkManager.conf? Or by simply changing the config file that NetworkManager-config-server ships?

the required block device appears, the wait cycle is ended and the system
continues booting, stopping any queued `nbft-boot-connect.service` respawns
seamlessly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see below that nbft-boot-connect.service contains After=network-online.target. Thus no NBFT target will be connected before this target is reached. Maybe I misunderstand when you consider "online" state to be reached. Is it already reached when just one iBFT interface is configured?

* `nm-wait-online-initrd.service` finishes, indicating networking is up and ready
* `nbft-boot-connect.service` initiates actual NVMe connections
* the dracut initqueue is waiting for specific block devices (rootfs) to appear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this, and the starting of NM above, are the only 2 things that dracut needs to perform, and therefore it's relatively easy to plug this scheme into mkosi or another initramfs generator, as long as that generator is based on systemd. Nice.

Am I understanding correctly?

@igaw
Copy link
Collaborator

igaw commented Jan 17, 2025

The nvme-cli maintainers will need to decide whether they want to take the responsibility of maintaining the dracut module. It's tricky because the module interfaces intimately with both nvme-cli and dracut.

I don't have any problems if this is added BUT I'd really want to see some sort of CI testing for this. This stuff will bit rot immediately after merging if we don't have regular upstream testing.

@mwilck
Copy link
Contributor

mwilck commented Jan 17, 2025

I don't have any problems if this is added BUT I'd really want to see some sort of CI testing for this. This stuff will bit rot immediately after merging if we don't have regular upstream testing.

That's the point, it's hard to write meaningful tests for this. It requires simulating at least a client/server setup with appropriately crafted BIOS settings. While there are client/server tests in the dracut test suite, I'm not aware of a test for NVMeoF/NBFT.

@igaw
Copy link
Collaborator

igaw commented Jan 17, 2025

We have a bunch of exising mock tests. Could we turn this into something more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants